Skip to content

[WIP]backend: Integrating QNN (Qualcomm AI Engine Direct) as a dedicated backend for Qualcomm NPUs #12063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 205 commits into
base: master
Choose a base branch
from

Conversation

chraac
Copy link

@chraac chraac commented Feb 25, 2025

Warning: This is an early draft of my fork and will continue to be updated to meet the requirements in the contributing guidelines

Summary

This fork is based on zhouwg's initial PR and performs further refactoring and improvements to introduce support for the Qualcomm QNN backend to GGML.

This backend is organized into three distinct integration layers:

graph TB
    subgraph GGML Adaptation Layer
        A1[Graph Caching, Mapping, and Execution]
        A2[Tensor Binding and Execution Flow]
    end

    subgraph QNN Object Layer
        B1[QNN System and Instance Management]
        B2[Dynamic Resource Handling]
    end

    subgraph Utility Layer
        C1[Dynamic Library Loading & Search Path Management]
        C2[General Utilities]
    end

    %% Relations to illustrate stack dependency
    A1 -->|Uses| B1
    A2 -->|Uses| B1
    B1 -->|Relies on| C1
Loading
  1. GGML Adaptation Layer

    • Graph Caching, Mapping, and Execution:

      • Provides a robust mechanism to map a GGML computation graph into a corresponding QNN graph, allowing efficient offloading of operations to the QNN accelerator.
      • Implements graph caching strategies (in backend-ops.cpp) to minimize redundant graph creation and boost execution performance.
      • Seamlessly translates GGML operations into corresponding QNN op objects using specialized op constructors and configuration functions (configured in op-config-caps.cpp and op-config-impl.cpp).
    • Tensor Binding and Execution Flow:

      • Adapts GGML tensor objects to the QNN backend (see tensor.hpp and graph.hpp), managing both host and RPC memory via buffer interfaces like qnn_buffer_interface.
      • Ensures proper data flow between GGML graphs and QNN execution contexts through carefully handled tensor binding/unbinding procedures.
  2. QNN Object Layer

    • QNN System and Instance Management:

      • Encapsulates the QNN system via the qnn_system_interface class, originally derived from executorch, to create and free the QNN system context.
      • Manages QNN instance creation and initialization via the qnn_instance class
      • Implements backend loading routines (e.g., load_backend() and load_system()) that retrieve provider lists and choose valid QNN interfaces based on API version checks.
      • Uses caching mechanisms for loaded backends and tracks library handles to guarantee proper cleanup during finalization.
    • Dynamic Resource Handling:

      • Integrates fallback mechanisms in load_lib_with_fallback() to reliably load both the system and RPC libraries.
      • Manages RPC memory allocation and deallocation via function pointer resolution from the loaded RPC library.
  3. Utility Layer

    • Dynamic Library Loading & Search Path Management:

      • Implements functions in qnn-lib.cpp to manage dynamic library loading with fallbacks.
      • Uses helper routines such as insert_path() and set_qnn_lib_search_path() to configure environment variables (like LD_LIBRARY_PATH on Linux and ADSP_LIBRARY_PATH on Android) based on a custom library search path.
    • General Utilities:

      • Provides detailed error and debug logging through QNN logging macros.

Key Features and Improvements

  • Graph Mapping Mechanism:

    • Efficient mapping of GGML graphs into QNN graphs is a standout feature, enabling the offloading and execution of computation graphs on hardware accelerators (see graph.hpp and backend-ops.cpp).
    • Graph caching strategies help reuse QNN graphs to reduce redundancy and enhance performance.
    • The translation of GGML operations into corresponding QNN ops supports various data types and parameter configurations.
  • Backend Context and Device Management:

    • Comprehensive QNN instance initialization supports API negotiation, enhanced error handling, and detailed device property logging.
    • Detailed logs (chipset description, HTP architecture, VTCM memory size) facilitate debugging and performance tuning.

Build

For build instructions please refer to this page

Testing

  • Basic functionality of the QNN backend has been verified on Android, Linux, and Windows platforms using test-backend-ops—this is integrated into the pipeline for each commit node of the dev-refactoring branch.

    Platform test-backend-ops full console output
    Android 2ac8fce111ee0047a5a8b43808047ff2 test-backend-ops_all_android_ff033e1.log
    Linux image test-backend-ops_all_linux_ff033e1.log
    Windows To be fill
  • Proper graph creation and execution paths are confirmed through detailed log messages.

  • Memory registration and cleanup within tensor binding functions have been thoroughly checked.

  • Table below shows GIFs of qnn backend running on different platforms

    Platform Soc Model Gif Origin video
    Android 8 Gen 2 llama-3-8B-Instruct-Q4_K_M Recording_Muted_hevc_14_126_640 Recording_Muted_hevc.mp4
    Windows To be fill

Current state

  • The test-backend-ops suite passes on all platforms, including support for both qnn-npu and qnn-gpu devices.
  • Testing with llama3.2-1b/3b-f16/32 models yields expected results.
  • Quantized matrix multiplication is under development; for quantized modules, the CPU backend may be used as a fallback.

Future development

  • Further feature support and device-specific optimizations are planned (see also the project backlog).
  • Future iterations will add support for quantization data types, with efforts underway to map GGML's block quantization structure into QNN.

zhou.weiguo and others added 30 commits April 24, 2024 16:28
auto old_mode = SetErrorMode(SEM_FAILCRITICALERRORS);
SetErrorMode(old_mode | SEM_FAILCRITICALERRORS);

auto handle = LoadLibraryA(lib_path.c_str()); // TODO: use wstring version for unicode paths
Copy link
Author

@chraac chraac Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @slaren, noticed we have similar dynamic library loading functionality in ggml-bacnend-reg.cpp (the dl_load_library function) that could be useful in other parts of the codebase.
I suggest moving this to a common utility module so we can reuse it across the project. This would help reduce code duplication and provide a consistent approach to loading libraries.
I'd be happy to prepare another PR about that, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this. I think that this code is small enough that it is not really a problem if it is duplicated in a backend, and making it part of the public API available to backends may make it harder to change it in the future. So at the moment my preference would be to avoid this.

chraac added 4 commits April 3, 2025 23:57
* add op define xml

* copy qnn libs in cmake

* fix htp skel path

* add windows copy file list

* wip

* add generated package

* remove unused params

* add cmake list

* set qnn sdk and hexagon sdk path

* wip

* wip

* fix tools version

* fix compiling error

* fix dims calc

* wip

* add mulmat 2d

* wip

* reduction

* wip

* wip

* fix compiling error in x64

* wip

* fix device description in emulator

* wip

* add flag

* copy necessary libs

* wip

* load HtpPrepare first for emulator

* enable custom op for 2d matrix

* verify op config before add to node

* Revert "verify op config before add to node"

This reverts commit 206dec8.

* wip

* wip

* wip

* revert tool version change

* use hexagon sdk version 5.5.0

https://docs.qualcomm.com/bundle/publicresource/topics/80-77512-2/release-notes-wrapper.html?product=1601111740010422#5.5.0

* wip

* move to sub dir

* add hexagon npu device and server lib

* fix npu lib build

* refactoring: rename QNNBackend enum

* fix compiling error

* wip

* remove qnn/backend.hpp

* add hexagon dsp host layer

* extract rpc_mem from qnn submodule

* fix dsp compiling error

* wip

* wip

* open and lose npu device

* split objects into separated files

* fix linking error

* add npu_tensor

* add host graph

* map rpc buffer before usage

* fix some todos

* add shared module

* split rpc_interface from rpc_mem

* get get_dsp_arch from device

* wip

* rename host classes

* fix hexagon sdk arch getter

* fix device open

* fix linking error

* fix crash

* use tensor_data_type

* fix npu lib crash

* fix debug log print

* skip empty graph

* wip

* add log

* fix unmap fail

* fix tensor set

* remove some logs

* flush back memory after finished

* fix nb

* wip

* wip

* add helper function

* impl add op

* fix some add in test-backend-ops

* add elt wise sub and mul

* fix crash on some inplace op

* wip

* fix elt wise op calc

* wip

* split mul_mat into file

* add caps array

* wip

* wip

* print support/unsupport op

* copy lldb-server for newer android sdk

* add tensor_spec

* add assert

* fix crash when loading model

* rename cmake option

* fix name

* fix device memory and description

* fix compiling error on qnn only build

* fix some potential UBs

* fix comments
chraac added 24 commits April 24, 2025 21:33
* add qurt_thread

* add thread pool

* add thread_pool obj at device ctx

* wip

* small refactoring to fit the thread pool structure

* set start/end threads for add

* init thread pool

* fix thread creation

* split complete and pending signals

* opt mulmat

* wip

* 2 threads

* back to 4 threads

* use barrier

* remove some unnecessary package

* add multi thread support for mul mat

* wip

* use qurt_barrier_t instead of qurt_signal_t

* wip

* wip

* add log

* split qnn cmake config

* create function to calculate the start and end func

* wip

* fix comment

* fix comment

* fix comment

* wip

* fix typo
* add f16 support to etl wise op

* wip

* Revert "wip"

This reverts commit efa88deb0e8265614fd91db3c3dba777c00e858b.

* qf32 for mul

* wip

* Revert "wip"

This reverts commit bb419f89ca4599470d61d636fe6fa1e033d62748.

* disable fp16 add/sub

* tempate trick

* wip

* add f16 mulmat

* add log

* fix view liked op

* add log

* fix f16 mulmat

* add quant type

* wip

* add l2fetch

* add vtcm_mem

* wip

* fix fetch

* use vtcm cache in mulmat

* revert vtcm cache

* cache plane

* small opt for plane cache

* cache plane for some element wise op

* wip

* enable fetch even on vtcm

* wip

* copy sysMonApp

* small opt

* init ltu

* add compute_params

* add op common header

* move vtcm_mem allocation to compute_param

* fallback to memcache when vtcm allocate failed

* pre-calculate quantize type

* wip

* try fix test failure

* try fix mulmat nan

* fix inf in mulmat

* remove debug logs

* wip

* small refactoring on the dequant row func

* fix typo

* improve logging

* add q4_0 and q8_0

* wip

* wip

* build hexagon libs in cmake

* wip

* fix qnn only build flag

* fix typo

* fix todo

* wip

* wip

* add to_float

* use to)float directly instead of ltu

* wip

* cache f16_to_f32 table into vtcm

* print tensor dims at log

* init device in supports_op_impl

* revert cache ltu

* wip

* wip

* fix graph calc issues by validate cache manually after each op

* add cache invalidate func

* enable cache fallback only in quantize tensors

* add option to disable quantized tensors

* propagate the asan flag to npu build

* fix asan option

* wip

* invalidate tensors after finished

* implement backend_buffer_reset

* wip

* wip

* refactoring plane cache mechanism

* wip

* split row elements across thread

* use table for f16 to f32 conversion

* sync after each op

* small refactoring to invalidate l2 cahce

* wip

* opt on float fetching

* unroll for loop manually

* reduce vtcm usage

* add perf tracking for npu

* print dimensions for profiler log

* wip

* wip

* wip

* add sub proc tracker

* fix typo

* print pcycles

* wip

* wip

* prefetch rows

* add l2fetch_row

* small tweak based on perf tracer

* opt l2 fetching

* wip
* wip

* refactor: rewrite dequantize_row_q4_0 by intrinsic

* log for debug

* fix q4 intrinsic

* small opt

* wip

* wip

* add vtcm_quota_size

* add perf log for hexagon-npu backend

* wip

* add log

* sync after a specfic op

* increase worker thread priority

* fix unbalanced thread slice

* small slict to fit in vtcm cache

* limit the supported row element size

* opt 4_0 dequant

* fix q4 dequant

* add power_utils

* add rms_norm

* wip

* enable rms_norm f32

* fix rms_norm with param

* fix compiling flags

* use float

* fix small row size

* vectorized rms norm

* wip

* read 2 vectors

* rename

* add perf log on update

* set empty tensors handle also

* merge some rpc functions

* opt param update

* wip

* print more log

* add struct for update param config

* add npu_device_graph_set_tensor_with_param

* merge tensor and params update

* wip

* wip

* make as template to reuse

* vectorize dequantize_row_q8_0

* opt

* avoid using union to store q data

* wip

* wip

* wip
* add flash attn op

* expend src tensor size

* add flash attn sources

* add quantize row functions

* make a separated file for vec_dot

* wip

* wip

* refactor: rename quants.hpp includes and add vec_dot to type traits

* add flash_attn impl

* split vec_scale_f32

* move vec_reduction_qf32 to vec_ops

* add vec_scale_f16

* opt

* add vec_mad

* implement vec_mad_f16

* opt

* add op template

* opt

* add align version

* enable flash attn

* wip

* log print improve

* add profiler log

* wip

* wip

* add multi sub proc perf tracker

* increase log buffer

* remove sub prov pcycle

* wip

* wip

* add prefetch for vec_dot

* wip

* wip

* opt f16 vec dot

* opt f16 vecdot

* reuse vec_dot_product_impl in vec dot f32

* small opt to unblock pipeline

* opt on aligned address

wip

* Revert "opt on aligned address"

This reverts commit 27be1eb.

* add profiler log at thread_pool

* wip

* invalidate all...

* Reapply "opt on aligned address"

This reverts commit f075a4c.

* add is_constant for tensor config

* disable align tensor opt in mul_mat

* wip

* wip

* vec_scale_impl: unrolling the loop

* wip

* wip

* replace reinterpret_cast with direct pointer access for write/read buffers

* add fetch

* wip

* wip

* wip

* add log

* check tensor shape at flash_attn

* wip

* wip

* fix: update tensor type handling in flash_attn_impl

* wip

* fix: align cache size

* fix: qf16->hf

* fix: swap order of elements in vector combine for correct scaling

* fix: opt f16 scale and mad

* fix leftover fetch

* wip

* load into vector pair

* opt cache size calculation in flash_attn_impl

* refactoring: hold vtcm at thread local object

* wip

* add profiler log

* mark tensors as modified

* restrict tensor invalidation to the first thread in compute_impl

* Revert "restrict tensor invalidation to the first thread in compute_impl"

This reverts commit 0a8ff2b.

* invalidate last tensor in compute_impl

* invalidate last tensor in compute function

* wip

* refactor dequantize_row_q4_0 to simplify vector alignment

* wip

* refactoring: move VTCM quota calculation to thread pool

* wip

* fix: correct condition check for HEXAGON_SDK_ROOT existence

* wip

* wip

* wip

* wip

* fix: update condition checks match the naming

* fix: improve tensor handling checks and logging in graph and operation implementations

* wip
* feat: add mixed precision dot product implementation and function declaration

* feat: implement mixed precision vector dot product and conversion functions

* fix: update data type handling in matrix multiplication implementation

* fix: adjust row count handling in matrix multiplication implementation for accurate slicing

* fix: optimize matrix multiplication implementation by unroll loop

* update performance tracking for matrix multiplication implementation

* add fetching

* wip

* fix: support F16 * F32 multiplication in is_mul_mat_supported function

* fix: improve src0 fetching logic in vec_dot_product_mixed_impl for better alignment handling

* fix test failure for row width 67

* try fix failed test

* fix: rename aligned_address to align_down for clarity in vector alignment handling

* wip

* qnn fix: update device capabilities for quantized types in qnn-lib to improve compatibility

* fix test failure at width == 193

* fix: replace zero vector initialization with previous vector in mixed dot product implementation

* wip

* fix: improve handling of last vector in mixed dot product implementation

* wip

* wip

* wip

* wip

* Enhance mul_mat_f32 function to support quantized types and improve static assertions

* rename

* Refactor dequantization functions to use npu_device_fp16_t and improve type handling

* Optimize dequantization in dequantize_row_q8_0 by replacing qf32 multiplication with qf16

* Optimize dequantization in dequantize_row_q4_0 by replacing qf32 multiplication with qf16

* Add hvx_vsf_convert_vhf function for improved vector conversion

* add perf logs

* Refactor dequantize_row_q4_0 for alignment

* Update logging in supports_op_impl and supports_op to use ggml_op_desc for better clarity

* Add support for ROPE operation in NPU capabilities and related functions

* Implement ROPE operation in tensor and op_rope, including cache initialization and correction dimension calculations

* enable ROPE by adding operation validation

* add support to freq is null case

* wip

* Refactor rope_f32 to improve indexing by introducing total_planes calculation

* reformat

* Refactor rope_f32 to optimize data access patterns by introducing row and plane pointers

* Add performance tracking to rope_f32 function for enhanced profiling

* Refactor rope_f32 to use a templated implementation

* Refactor rope_impl to replace loop with memcpy for improved performance

* Refactor mul_mat_impl to support quantization as a template parameter

* wip

* wip

* Refactor rope_impl to optimize plane indexing in the processing loop

* Add aligned vector dot product implementation for mixed precision types

* wip

* Enhance matrix multiplication for F32 and F16 types with alignment checks

* Optimize vec_dot_product_mix_aligned_impl for improved performance with additional vector sums

* Add alignment checks for matrix multiplication and vector dot products

* Refactor matrix multiplication to use function pointers for improved readability and maintainability

* Fix alignment check in is_dot_product_aligned to ensure correct vector size handling

* Remove unused f16_to_f32_table parameter from quantization and dequantization functions

* wip

* Add L2 fetch for src1 plane rows in matrix multiplication implementation

* wip

* Refactor hvx_vsf_convert_vhf to accept an additional parameter for flexibility in vector multiplication

* Refactor vec_dot_product_mix_aligned_impl to improve variable naming for clarity

* Refactor load_dual_block_generic and dequantize_row_q4_0 to improve performance

* Refactor vector operation functions to improve clarity and consistency in variable usage

* wip

* wip

* Refactor dequantize_row_q4_0_impl for improved clarity and performance in vector operations

* wip

* Update load_dual_block_generic to use intrinsics

* Refactor load_dual_block_generic and load_qual_block_generic for improved performance and clarity

* wip

* wip

* Optimize dequantize_row_q8_0 for improved performance by unrolling for loop

* wip

* wip

* fix typo
# Conflicts:
#	ggml/src/ggml-backend-reg.cpp
# Conflicts:
#	ggml/CMakeLists.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants